# Tutorial VI: Recurrent Neural Networks

<p>
Bern Winter School on Machine Learning, 27-31 January 2020<br>
Prepared by Mykhailo Vladymyrov.
</p>

This work is licensed under a <a href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>.

In this session we will see what RNN is. We will use it to predict/generate text sequence, but same approach can be applied to any sequential data.


So far we looked at the data available altogether. In many cases the data is sequential (weather, speach, sensor signals etc).
RNNs are specifically designed for such tasks.

<img src="https://scits-training.unibe.ch/data/figures/rnn.png" alt="drawing" width="90%"/><br>



## 1. Load necessary libraries

In [0]:
# if using google colab
%tensorflow_version 2.x

In [0]:
import sys

import numpy as np
import matplotlib.pyplot as plt
import IPython.display as ipyd
import tensorflow as tf
import collections
import time

# We'll tell matplotlib to inline any drawn figures like so:
%matplotlib inline
plt.style.use('ggplot')

from IPython.core.display import HTML
HTML("""<style> .rendered_html code { 
    padding: 2px 5px;
    color: #0000aa;
    background-color: #cccccc;
} </style>""")

physical_devices = tf.config.experimental.list_physical_devices('GPU')
tf.config.experimental.set_memory_growth(physical_devices[0], True)

## unpack libraries
if using colab, run the next cell

In [0]:
p = tf.keras.utils.get_file('./material.tgz', 'https://scits-training.unibe.ch/data/tut_files/material.tgz')
!mv {p} .
!tar -xvzf material.tgz > /dev/null  2>&1

In [0]:
from utils import gr_disp

## 2. Load the text data

In [0]:
def read_data(fname):
    with open(fname) as f:
        content = f.readlines()
    content = [x.strip() for x in content]
    content = [word for i in range(len(content)) for word in content[i].split()]
    content = np.array(content)
    return content

In [0]:
training_file = 'RNN/rnn.txt'

In [0]:
training_data = read_data(training_file)

In [0]:
print(training_data[:100])

## 3. Build dataset
We will assign an id to each word, and make dictionaries word->id and id->word.
The most frequently repeating words have lowest id

In [0]:
def build_dataset(words):
    count = collections.Counter(words).most_common()
    dictionary = {}
    for word, _ in count:
        dictionary[word] = len(dictionary)
    reverse_dictionary = dict(zip(dictionary.values(), dictionary.keys()))
    return dictionary, reverse_dictionary

In [0]:
dictionary, reverse_dictionary = build_dataset(training_data)
vocab_size = len(dictionary)

In [0]:
print(dictionary)

Then the whole text will look as a sequence of word ids:

In [0]:
words_as_int = [dictionary[w] for w in training_data]
print(words_as_int)

## 4. Build model

We will build the model in TF2.
It will contain an embedding layer, and three LSTM layers.
Dense layer on top is used to output probability of the next word:

In [0]:
# Parameters
n_input = 3  # word sequence to predict the following word

# number of units in RNN cells
n_hidden = [256, 512, 128]

model = tf.keras.Sequential()
model.add(tf.keras.layers.Embedding(vocab_size, 128, input_length=n_input))

for n_h in n_hidden:
  model.add(tf.keras.layers.LSTM(n_h, return_sequences=True, name='lstm%d' % n_h))

model.add(tf.keras.layers.Dense(vocab_size, activation='softmax'))

model.compile(optimizer='RMSProp',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

W0 = model.get_weights()  # to reset model to original state:
model.summary()

## 5. Data streaming

Here we will see how to feed a dataset for model training:

In [0]:
# create tf.data.Dataset object
word_dataset = tf.data.Dataset.from_tensor_slices(words_as_int)

In [0]:
# take metod generates elements:
for i in word_dataset.take(5):
  print(reverse_dictionary[i.numpy()])

The `batch` method creates dataset, that generates sequences of elements:

In [0]:
sequences = word_dataset.batch(n_input+1, drop_remainder=True)

In [0]:
# helper for int-to-text conversion
to_text = lambda arr:' '.join([reverse_dictionary[it] for it in arr])

In [0]:
for item in sequences.take(5):
  print(to_text(item.numpy()))

 The `map` method allows to use any function to preprocess the data:

In [0]:
def split_input_target(chunk):
    input_text = chunk[:-1]
    target_text = chunk[1:]

    return input_text, target_text

dataset = sequences.map(split_input_target)

The model will predict input_text -> target_text:

In [0]:
for input_example, target_example in  dataset.take(1):
  print ('Input data: ', to_text(input_example.numpy()))
  print ('Target data:', to_text(target_example.numpy()))

Finally we will shuffle the items, and produce minibatches of 16 elements:

In [0]:
dataset = dataset.shuffle(10000).batch(16, drop_remainder=True)
dataset

Let's test not trained model:

In [0]:
for input_example_batch, target_example_batch in dataset.take(1):
  example_batch_predictions = model(input_example_batch)
  print(example_batch_predictions.shape, "# (batch_size, sequence_length, vocab_size)")

In [0]:
print('input: ', to_text(input_example_batch.numpy()[0]))
print('output:', to_text(target_example_batch.numpy()[0]))
print('pred:  ', to_text(example_batch_predictions.numpy()[0].argmax(axis=1)))


## 5. Train!

In [0]:
#model.set_weights(W0)
history = model.fit(dataset, epochs=200, verbose=1)

In [0]:
def draw_history(hist):
  fig, axs = plt.subplots(1, 2, figsize=(10,5))
  axs[0].plot(hist.epoch, hist.history['loss'])
  if 'val_loss' in hist.history:
    axs[0].plot(hist.epoch, hist.history['val_loss'])
  axs[0].legend(('training loss', 'validation loss'))
  axs[1].plot(hist.epoch, hist.history['accuracy'])
  if 'val_accuracy' in hist.history:
    axs[1].plot(hist.epoch, hist.history['val_accuracy'])

  axs[1].legend(('training accuracy', 'validation accuracy'))
  plt.show()

In [0]:
draw_history(history)

## 6. Generating text with RNN

Take word sequence and generate the following 128 words:

In [0]:
def gen_long(model, word_id_arr, n_words=128):
  out = []
  words = list(word_id_arr.copy())
  for i in range(n_words):
      keys = np.reshape(np.array(words), [-1, n_input])

      onehot_pred = model(keys).numpy()[0]
      pred_index = onehot_pred.argmax(axis=1)
      pred = pred_index[-1]
      out.append(pred)

      words = words[1:]
      words.append(pred)
  sentence = to_text(out)
  return sentence

In [0]:
for input_example_batch, target_example_batch in dataset.take(10):
  input_seq = input_example_batch.numpy()[0]
  sentence = gen_long(model, input_seq)
  print(to_text(input_seq), '...')
  print('\t...', sentence, '\n')

Or try to input some text and see continuation:

In [0]:
while True:
    prompt = "%s words: " % n_input

    try:
      sentence = input(prompt)
    except KeyboardInterrupt:
      break

    sentence = sentence.strip()
    words = sentence.split(' ')
    if len(words) != n_input:
        continue
    try:
        symbols_in_keys = [dictionary[str(words[i])] for i in range(len(words))]
    except:
        print("Word not in dictionary")
        continue

    sentence = gen_long(model, symbols_in_keys)
    print(sentence)


## 7. Excercice 


* Run with 5-7 input words instead of 3.
* increase number of training iterations, since convergance will take much longer (training as well!).

## 8. Further reading

[Illustrated Guide to Recurrent Neural Networks](https://towardsdatascience.com/illustrated-guide-to-recurrent-neural-networks-79e5eb8049c9)

[Illustrated Guide to LSTM’s and GRU’s: A step by step explanation](https://towardsdatascience.com/illustrated-guide-to-lstms-and-gru-s-a-step-by-step-explanation-44e9eb85bf21)